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Abstract 


We  describe  our  four-camera  multibaseline  stereo  system  in  a  convergent  configuration  and 
our  implementation  of  a  parallel  depth  recovery  scheme  for  this  system.  Our  system  is  capa¬ 
ble  of  image  capture  at  video  rate.  This  is  critical  in  applications  that  require  three-dimen¬ 
sional  tracking.  We  obtain  dense  stereo  depth  data  by  projecting  a  light  pattern  of  frequency 
modulated  sinusoidally  varying  intensity  onto  the  scene,  thus  increasing  the  local  discrim- 
inability  at  each  pixel  and  facilitating  matches.  In  addition,  we  make  most  of  the  camera 
view  areas  by  converging  them  at  a  volume  of  interest.  Results  indicate  that  we  are  able  to 
extract  stereo  depth  data  that  are,  on  the  average,  less  than  1  mm  in  error  at  distances 
between  1.5  to  3.5  m  away  from  the  cameras. 
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1  Introduction 

Binocular  stereo  vision  is  a  simple  and  flexible  method  by  which  three-dimensional 
(range)  information  of  a  scene  can  be  obtained.  Therefore,  it  is  not  surprising  to  find  that 
stereo  is  a  very  active  area  of  research  [2].  The  geometrical  issues  in  stereo  have  also  been 
well  explored  [6].  The  primary  drawback  of  stereo  is  the  problem  with  image  point  corre¬ 
spondence  (for  a  survey  of  correspondence  techniques,  see  [5]).  The  trade-off  between 
accuracy  (which  is  aided  by  a  wide  baseline,  or  separation  between  the  cameras)  and  ease 
of  correspondence  (which  is  simpler  with  a  narrow  baseline)  has  been  mitigated  using 
multiple  cameras  or  camera  locations.  Such  an  approach  has  been  termed  multibaseline 
stereo  [1?1. 

Stereo  vision  is  computationally  intensive.  Fortunately,  the  spatially  repetitive  nature  of 
depth  recovery  lends  itself  to  parallelization.  This  is  especially  critical  in  the  case  of  multi¬ 
baseline  stereo  with  high  image  resolution  and  the  practical  requirement  of  timely  extrac¬ 
tion  of  data.  A  number  of  researchers  have  worked  on  fast  implementation  of  stereo  (e.g., 
[11],  [13],  [14]). 

In  this  report,  we  describe  our  implementation  of  a  depth  recovery  scheme  implemented  in 
iWarp  for  a  four-camera  multibaseline  stereo  in  a  convergent  configuration.  Our  system  is 
capable  of  image  capture  at  video  rate.  This  is  critical  in  applications  that  require  tracking 
in  three  dimensions  (an  example  is  [10]).  One  method  to  obtain  dense  stereo  depth  data  is 
to  interpolate  between  reliable  pixel  matches  [8].  However,  the  interpolated  values  may 
not  be  accurate.  We  obtain  accurate  dense  depth  data  by  projecting  a  light  pattern  of  sinu¬ 
soidally  varying  intensity  onto  the  scene,  thus  increasing  the  local  discriminability  at  each 
pixel.  In  addition,  we  make  the  most  of  the  camera  view  areas  by  converging  them  at  a 
volume  of  interest.  Experiments  have  indicated  that  we  are  able  to  extract  stereo  depth 
data  that  are,  on  the  average,  less  than  I  mm  in  error  at  distances  between  1.5  to  3.5  m 
away  from  the  cameras. 

We  introduce  the  notion  of  an  active  multibaseline  stereo  for  extraction  of  dense  stereo 
range  data  in  Section  2.  The  principle  of  multibaseline  stereo  is  explained,  and  in  addition, 
we  justify  our  use  of  the  camera  system  in  a  convergent  configuration.  In  this  section,  we 
briefly  describe  our  image  acquisition  system  that  enables  us  to  capture  intensity  images  at 
video  rate  (30  Hz).  Before  the  camera  system  can  be  used,  it  must  be  calibrated;  this  pro¬ 
cedure  is  described  in  Section  3. 

Prior  to  depth  recovery,  we  apply  a  warping  operation  called  ima^e  rectification  to  the  set 
of  images  as  a  preprocessing  step  for  computational  reasons;  this  warping  operation  is 
described  in  Section  4.  Our  implementation  of  the  depth  recovery  algorithm  is  subse¬ 
quently  detailed  in  this  section. 

Finally,  we  present  results  of  our  experiments  in  Section  5,  analyze  the  sources  of  error  in 
our  system  in  Section  6,  and  summarize  our  work  in  Section  7. 
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2  The  active  4-camera  system 

Our  multibaseline  camera  system  is  shown  in  Fig.  1.  It  comprises  four  cameras  mounted  on 
a  plain  metal  bar,  which  in  turn  is  mounted  on  a  sturdy  tripod  stand;  each  camera  can  be 
rotated  about  a  vertical  axis  and  fixed  at  discrete  positions  along  the  bar.  The  four  camera 
video  signals  are  all  synchronized  by  ganging  the  genlock  signals. 


Fig.  I  The  4-camera  system 


In  addition  to  the  camera,  we  use  a  projector  to  cast  a  pattern  of  sinusoidal  varying  intensity 
(active  lighting)  onto  the  scene.  This  notion  of  an  active  multibaseline  stereo  allows  a  denser 
depth  map  as  a  result  of  improved  local  scene  discrimination  and  hence  correspondence. 

2.1  The  principle  of  multibaseline  stereo 

In  binocular  stereo  where  the  two  camera  axes  are  parallel,  depth  can  easily  be  calculated 
given  the  disparity  (the  shift  in  position  for  corresponding  points  between  the  images).  If  the 
focal  length  of  both  cameras  is/,  the  baseline  b  and  di.sparity  d,  then  the  depth  j  is  given  by 
z  =  /•  b/d  (Fig.  2). 

In  multibaseline  stereo,  more  than  two  cameras  or  camera  locations  are  employed,  yielding 
multiple  images  with  different  baselines  [12],  In  the  parallel  configuration,  each  camera  is  a 
lateral  displacement  of  the  other.  From  Fig.  2,  d  =  f  ■  b/z  (we  assume  for  illustration  that 
the  cameras  have  identical  focal  lengths). 

For  a  given  depth,  we  then  calculate  the  respective  expected  disparities  relative  to  a  refer¬ 
ence  camera  (say,  the  leftmost  camera)  as  well  as  the  sum  of  match  errors  over  all  the  cam¬ 
eras.  (An  example  of  a  match  error  is  the  image  difference  of  image  patches  centered  at 
corresponding  points.)  By  iterating  the  calculations  over  a  given  resolution  and  interval  of 
depths,  the  depth  associated  with  a  given  pixel  in  the  reference  camera  is  taken  to  be  the  one 
with  the  lowest  error. 
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Fig.  2  Relationship  between  the  baseline  b,  disparity  d,  focai  length  f,  and  depth  z 

The  multibaseline  approach  has  the  advantage  of  reducing  mismatches  during  correspon¬ 
dences  due  to  the  simultaneous  multiple  baselines.  In  addition,  it  produces  a  statistically 
more  accurate  depth  value  [12].  However,  using  multiple  cameras  alone  does  not  solve  the 
problem  of  matching  ambiguity  that  occurs  with  smooth  untextured  object  surfaces  in  the 
scene.  This  is  the  reason  why  the  idea  of  using  active  lighting  in  the  form  of  a  projected  pat¬ 
tern  on  the  scene  is  important.  The  projected  pattern  on  object  surfaces  in  the  scene  helps  in 
disambiguiting  local  matches  in  the  camera  images. 

2.2  Why  use  a  verged  camera  configuration? 

The  primary  problem  associated  with  a  stereo  arrangement  of  parallel  camera  locations  is 
the  limited  overlap  between  the  fields  of  views  of  all  the  cameras.  The  percentage  of  overlap 
increases  with  depth.  The  primary  advantage  is  the  simple  and  direct  formula  in  extracting 
depth. 

The  parallel  camera  configuration  is  suitable  for  outdoor  applications  where  accuracy  is  not 
of  utmost  importance  while  speed  is  (e.g.,  [13]).  A  problem  with  this  configuration  is  the 
low  percentage  of  overlap  in  the  field  of  views  of  the  cameras. 

Verging  the  cameras  at  a  specific  volume  in  space  is  optimal  in  an  indoor  application  where 
maximum  utility  of  the  camera  visual  range  is  desired  and  the  workspace  size  is  constrained 
and  known  a  priori.  Such  a  configuration  is  illustrated  in  Fig.  3.  One  such  application  is  the 
tracking  of  objects  in  the  Assembly  Plan  from  Observation  project  [9],  The  aim  of  the 
project  is  to  enable  a  robot  system  observe  a  human  perform  a  task,  understand  the  task,  and 
replicate  that  task  using  a  robotic  manipulator.  By  continuously  monitoring  the  human  hand 
motion,  motion  breakpoints  such  as  the  point  of  grasping  and  ungrasping  an  object  can  be 
extracted  [10].  The  verged  multibaseline  camera  system  can  extend  the  capability  of  the  sys- 
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Camera  1 


Fig.  3  A  verged  camera  configuration  (dark  shaded  area  is  the  common  3D  space  viewable 

from  all  cameras). 

tern  to  tracking  the  object  being  manipulated  by  the  human.  For  this  purpose,  we  require  fast 
image  acquisition  (though  processing  is  not  as  critical)  and  accurate  depth  recovery. 


2.3  Video>rate  image  acquisition  system 

Our  image  acquisition  system  consists  of  the  physical  camera  setup  described  earlier  in  this 
section,  the  video  interface  board,  and  the  8x8  matrix  of  iWarp  cells  (Fig.  4).  Each  iWarp 
component  contains  a  20  MFLOPS  computation  engine  and  low-latency  ( 100- 1 50  ns)  com¬ 
munication  engine  for  interfacing  with  other  iWarp  cells  [3].  The  existing  iWarp  system  is 
an  8x8  torus  of  iWarp  cells,  half  of  which  have  16  MB  DRAMS  per  cell.  The  video  inter¬ 
face,  which  is  described  in  detail  elsewhere  [17],  is  connected  directly  to  the  iWarp  cell 
through  the  memory  interface;  the  digitized  video  data  is  routed  and  distributed  at  video  rate 
to  the  DRAMs  by  taking  advantage  of  iWarp’s  systolic  design  [4]. 


3  Camera  calibration 

Before  data  images  can  be  taken  and  the  scene  depth  recovered,  we  must  first  calibrate  the 
camera  configuration.  Calibrating  the  camera  configuration  refers  to  the  determination  of 
the  extrinsic  (relative  pose)  and  intrinsic  (optic  center  offset,  focal  length  and  aspect  ratio) 
camera  parameters.  The  pinhole  camera  model  is  a.ssumed  in  the  calibration  process.  The 
origin  of  the  verged  camera  configuration  coincides  with  that  of  the  leftmost  camera. 

A  printed  planar  dot  pattern  arranged  in  a  7x7  equally  spaced  grid  is  used  in  calibrating  the 
cameras;  images  of  this  pattern  are  taken  at  known  depth  positions  (five  in  our  ca.se).  An 
example  set  of  images  taken  by  the  camera  system  is  shown  in  Fig.  S. 
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Fig.  4  Block  diagram  of  the  image  acquisition  system.  The  shaded  boxes  labeled  “M”  indicate  the 
16M  DRAMs  connected  to  local  iWarp  cells  while  the  shaded  box  labeled  “VI"  refers  to  the  video 

interface  connected  to  one  of  the  iWarp  cells. 
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(a)  (b)  (c)  (d) 

Fig.  5  Calibration  images  (equalized)  taken  from  the  convergent  camera  configuration  ((a)-(d)) 


The  dots  of  the  calibration  pattern  are  detected  using  a  star-shaped  template  with  the  weight 
distribution  decreasing  towards  the  center.  The  entire  pattern  is  extracted  and  tracked  from 
one  camera  to  the  next  by  imposing  structural  constraints  of  each  dot  relative  to  its  neigh¬ 
bors,  namely  by  determining  the  nearest  and  second  nearest  distances  to  another  dot.  This 
filters  out  wrong  dot  candidates,  as  shown  in  Fig.  6. 

The  simultaneous  recovery  of  the  camera  parameters  of  all  four  cameras  can  be  done  using 
the  non-linear  least-squares  technique  described  by  Szeliski  and  Kang  |  I6j.  The  inputs  and 
outputs  to  this  module  are  .shown  in  the  simplified  diagram  in  Fig.  7.  An  alternative  would 
be  to  use  the  pairwise-.stereo  calibration  approach  proposed  by  Faugeras  and  Toscani  [7]. 
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Fig.  6  Detecting  and  tracking  the  calibration  points  (only  part  of  the  image  associated  with  Camera  1 
is  shown).  The  black  +’s  are  the  detected  points  while  the  white  +’s  are  the  spurious  and  rejected 
points:  (a)  Points  detected  in  image  of  Camera  I;  (b)  Points  detected  in  images  of  Cameras  1  and  2:  (c) 
Points  detected  in  images  of  Cameras  1, 2,  and  3;  (c)  Points  detected  in  all  images. 
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Fig.  7  Non-linear  least-squares  approach  to  extraction  of  camera  parameters 


4  Image  rectification  and  depth  recovery 

If  two  camera  axes  are  not  parallel,  their  associated  epipolar  lines  are  not  parallel  to  the  scan 
lines.  This  introduces  extra  computation  to  extract  depth  from  stereo.  To  simplify  and 
reduce  the  amount  of  computation,  rectification  can  be  carried  out  first.  The  process  of  recti¬ 
fication  for  a  pair  of  images  (given  the  camera  parameters,  either  through  direct  or  weak  cal¬ 
ibration)  transforms  the  original  pair  of  image  planes  to  another  pair  such  that  the  resulting 
epipolar  lines  are  parallel  and  equal  along  the  new  scan  lines.  Rectification  is  depicted  in 
Fig.  8.  Here  C|  and  C2  are  the  camera  optical  centers,  rij  and  172  ^be  original  image  planes, 
and  Q|  and  £22  ^be  rectified  image  planes.  The  condition  of  parallel  and  equal  epipolar  lines 
necessitates  planes  and  Qj  be  in  the  same  plane,  indicated  as  £2 12-  A  point  q  is  pro¬ 
jected  to  image  points  V]  and  V2  on  the  same  scan  line  in  the  rectified  planes. 
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A  simple  rectification  method  is  described  in  [1].  However,  the  rectification  process 
described  there  is  a  direct  function  of  the  locations  of  the  camera  optical  centers.  It  is  not 
apparent  how  the  desirable  properties  of  minimal  distortion  and  maximal  inclusion  can  be 
achieved  with  their  formalism.  We  have  modified  their  formalism  to  simplify  the  rectifica¬ 
tion  mapping  and  adapt  it  to  our  situation. 

Let  the  original  3x4  perspective  transforms  of  two  cameras  be  P]  and  P2,  where 

Pjl  P,14 
PJ2  P;24 
Pjs  P;34 


The  original  perspective  transform  Pj  is  constructed  from  known  camera  parameters  of  the 
form 


“y  = 


fj  0  0 
Oa/.O 
0  0  1 


fj  0  0 

Oa/^O 
0  0  1 


T 

**71  Vx 
T 

■'72  ‘yy 

T 

."7.^  '7-1 


q  = 


f/ji  fj^j^ 

.  T 

a  tx  ,  u  t \ 
n  2  j’j  h 


7-' 


q  =  P  q 


where  the  tilde  (~)  above  the  vector  indicates  its  homogeneous  representation,  q  is  the  3D 
point,  Uy  the  image  coordinate  vector, the  focal  length,  Uj  the  aspect  ratio,  and  Ry  and  ty  the 
extrinsic  camera  parameters.  It  is  easy  to  see  that  the  camera  axis  vector  is  ry3,  and  in  the 
camera  image  coordinate  system,  the  x-  and  y-directions  are  along  ry|  and  ry2,  respectively. 
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Also,  let  M  and  N  be  the  rectified  perspective  transforms,  respectively,  where 


T 

T 

m, 

m,4 

"1 

"14 

T 

m2 

^24 

and 

N  = 

T 

“2 

024 

T 

T 

m3 

m34 

L"3 

034 

Since  perspective  matrices  are  defined  up  to  a  scale  factor,  we  can  set  both  m34  and  034  to  be 
unity.  Accordingly,  based  on  the  analysis  in  [  1  ],  1113  =  n3,  m2  =  n2,  m24  =  024,  and  from  the 
constraint  that  C|  and  C2  remain  the  optical  centers. 


m,  c,  +  m,4  =  0 
T 

m2C|  +  m24  =  0 

T 

m2C2  +  m24  =  0 

iij  C2  +  n|4  =  0 
T 

m3C,  +  1=0 

T 

*”3*^2  +1=0 


Let  d|2  =  C]  -  C2.  In  a  departure  from  [1],  we  choose  the  common  rectified  camera  axis 
direction  not  only  to  be  perpendicular  to  d|2,  but  also  to  point  in  the  direction  between  those 
of  the  unrectified  camera  axes  (i.e.,  r  13  and  r23).  This  is  done  by  first  calculating 

8  =  •’13  + •'23 


We  then  find  the  nearest  vector  perpendicular  to  di2: 

*12 


g  =  g- 


g  d, 


2“12 


Thus, 


™3  =  "3  =  = 

g'  '1 


g 


,T 

g  C2 


Determining  m2  (and  hence  m24)  is  similar,  with  the  additional  constraint  that 


mj  = 


‘1/ 


Finally,  m|  is  determined  from  the  relation 

m,  =  T(m2xm3) 
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X  (and  hence  mj  and  0114)  is  calculated  based  on  the  constraint 


Hi  and  ni4  are  calculated  in  the  same  way,  using  the  counterpart  values  of  P2. 


As  in  [1],  the  homographies  (or  linear  projective  correspondences)  that  map  the  unrectified 
image  coordinates  to  the  rectified  image  coordinates  are 


H,  = 


m 


m. 


m. 


[<. 


P12XP1O  (PnXPti)  (PuXp 


’13J 


'13 


IP 


'11 


*12^] 


where 


U|  and  V,  are  the  homogeneous  unrectified  and  rectified  image  coordinates,  respectively, 
and 


H2  = 


Hi 


n-, 


n. 


[(P22^P23) 


(P23XP2,)  (P2,Xp 


22)] 


with 


V2  =  H2U2 


U2  and  V2  similarly  defined. 

To  recover  depth  from  multibaseline  stereo  (specifically  a  4-camera  system)  in  a  convergent 
configuration,  we  first  rectify  pairs  of  images  as  shown  in  Fig.  9. 

There  are  two  schemes  which  allows  us  to  recover  depth.  The  first  u.ses  all  the  homographies 
between  the  unreclified  images  and  rectified  images  (namely  Hu,  H|2,  H13,  H21.  H32,  and 
H43  in  Fig.  10). 


10 

Camera  1  Camera2  Cameras  Camera4 


Fig.  9  Image  rectification  scheme 


4.1  Direct  approach  for  depth  recovery 

Subsequent  to  rectification,  to  recover  depth,  we  first  determine  the  corresponding  location 
in  the  rectified  image  plane  for  the  three  pairs  of  cameras  (Fig.  10).  We  wish  to  recover  the 
3D  location  q  of  the  point  corresponding  to  Uq.  q  can  be  specified  in  the  following  form: 

q  =  c,  +  Xd 

where  C]  is  the  optical  center  of  the  first  (“reference”)  camera,  d  is  the  unit  vector  in  the 
direction  from  C]  to  q,  and  X  is  the  depth  of  q  from  the  reference  camera  optical  center.  If 
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Fig.  10  Recovering  depth  from  multibaseline  stereo  after  rectification 
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since  P|Ci  =  [0  0  0]  .  So 
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To  find  the  disparity.  Ay  =  x'j  -  jy,  as  a  function  of  the  projection  transform  elements,  we  first 
find  the  expressions  for  the  rectified  image  coordinates  (noting  that  vy  =  y’y): 
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By  varying  X,  within  a  specified  interval  and  resolution,  we  can  calculate  Ay’s  for  the  pairs  of 
rectified  images,  and  hence  calculate  the  sum  of  matching  errors  (as  in  [13]  with  multiple 
parallel  cameras).  The  depth  is  recovered  by  picking  the  value  of  X.  associated  with  the  least 
matching  error. 

4.2  A  computationally  more  efficient  approach  for  depth  recovery 

The  method  described  above  implies  that  we  must  calculate,  at  each  point  and  for  each 
depth,  the  corresponding  points  in  all  images.  This  requires  projective  transformations  of  all 
images  to  be  performed  for  each  depth  value.  There  is  a  more  computationally  efficient  way 
to  recover  depth.  This  stems  from  the  following  properties: 

1.  The  two  rectified  planes  fall  on  the  same  plane. 

2.  The  line  joining  the  two  projection  centers  is  parallel  to  this  common  plane. 

Properties  1  &  2  (which  are  the  necessary  conditions  for  rectification)  give  rise  to 

3.  The  homography  between  the  two  rectified  planes  cannot  be  projective  (since  the  scan 
lines  on  the  rectified  images  are  parallel,  i.e.,  the  corresponding  rows  at  both  rectified 
images  are  equal).  This  is  true  since  the  “projection”  lines  (the  corresponding  scan  lines) 
meet  at  infinity. 
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From  3,  the  homography  between  rectified  planes  must  then  be  at  most  a  2D  affine  trans¬ 
form,  i.e.,  the  last  row  of  the  homography  matrix  must  be  (0  0  1 ).  This  dispenses  with  the 
additional  division  by  the  z-component  in  calculating  the  corresponding  matched  point  for  a 
particular  depth. 


The  scheme  now  follows  that  in  Fig.  11.  The  matching  is  done  using  the  homographies 
between  rectified  images  K|,  K2  and  K3  (which  we  term  as  rectified  homographies).  The 
rectified  homographies  can  be  readily  determined  as  follows: 


For  a  known  depth  plane  (z  =  d),  we  can  “contract”  the  3x4  perspective  matrix  M  (to  the  rec¬ 
tified  plane)  to  a  3x3  homography  G.  For  camera  /,  we  have 
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where  py  is  the 7th  column  of  M/  and  (u/,  v/)^  is  the  projected  image  point  in  camera  /.  Sim¬ 
ilarly,  for  camera  m, 
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Since  the  rectified  planes  are  coplanar,  s/  =  .v„,;  hence 
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Note  that,  due  to  rectification,  =  v/,  and  as  explained  earlier  in  this  subsection,  the  bottom 
row  of  K/;„  is  (0  0  1).  In  other  words,  the  projective  transformations  are  reduced  to  affine 
transformations,  reducing  the  amount  of  computation. 

Depth  recovery  then  proceeds  in  a  similar  manner  as  the  direct  approach  described  in  the 
previous  subsection. 

4.3  An  approximate  depth  recovery  approach 

In  both  approaches  described  earlier,  for  each  depth,  each  pixel  in  the  unrectified  reference 
image  has  to  be  mapped  /^cameras  ~  *  to  the  respective  rectified  images  (correspond¬ 
ing  to  the  homographies  Hu,  H12,  and  H13  in  Fig.  11).  We  can  work  in  the  rectified  image 
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Camera  1  Camera2  Camera3  Camera4 


Fig.  II  A  computationally  more  eflicient  depth  recovery  scheme 


coordinates  (say  Mj),  but  this  still  requires  mapping  from  M2  to  M|  and  M3  to  M|  in  the 
collection  of  match  errors  for  each  depth  value.  This  means  that  we  need  to  perform 
(A^cameras  “  ^ depth  bilinear  interpolations  associated  with  image  warping  (where 

N depth  the  number  of  depth  values  and  is  the  number  of  cameras). 

In  order  to  avoid  the  warping  operations,  we  use  an  approximate  depth  recovery  method. 
The  matching  is  done  with  respect  to  the  rectified  image  of  the  first  pair.  However,  the  recti¬ 
fied  images  N2  and  N3  will  not  be  row  preserved  relative  to  M]  (Fig.  12).  We  warp  rectified 
images  N2  and  N3  so  as  to  preserve  the  rows  as  much  as  possible,  resulting  in  N’2  and  N’3 
(Fig.  12).  The  errors  should  be  tolerably  small  as  long  as  the  vergence  angles  are  small.  In 
addition,  this  effect  should  not  pose  a  significant  problem  as  we  are  using  a  local  windowing 
technique  in  calculating  the  match  error. 
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Fig.  12  The  approximate  depth  recovery  scheme  (compare  this  with  Fig.  11) 


By  comparing  Fig.  12  with  Fig.  1 1,  we  can  see  that  the  mapping  from  M|  to  No  is  given  by 
the  homography  L|2  =  K|3H(2H||  ’.  Similarly,  the  mapping  from  M|  to  N3  is  given  by  L13 
=  K14H13H11'.  The  matrices  A2  and  A3  are  constructed  such  that 
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i.e.,  the  resulting  overall  mapping  is  row  preserving  (r  and  c  are  the  row  and  column  respec¬ 
tively).  In  general,  this  would  not  be  possible,  unless  all  the  camera  centers  are  colinear; 
however,  this  is  a  good  approximation  for  small  vergence  angles  and  approximately  aligned 
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cameras.  A2  and  A3  are  calculated  from  the  following  overconstrained  relation  using  the 
pseudoinverse  calculation: 
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where  is  associated  with  the  minimum  depth  and  L^J''  with  the  maximum  depth, 
and  are  the  minimum  and  maximum  values  of  the  image  column,  and  and  are 
the  minimum  and  maximum  values  of  the  image  row,  respectively.  X,  (t=l,...,8)  are  don’t- 
care  values.  The  symbol  I  is  used  to  represent  matrix  augmentation. 


This  algorithm  has  been  implemented  in  parallel  using  the  Fx  (parallel  Fortran)  language 
developed  at  Carnegie  Mellon  [15].  Fx,  a  variant  of  High  Performance  Fortran  with  optimi¬ 
zations  for  high-communication  applications  like  signal  and  image  processing,  runs  on  the 
Carnegie  Mellon-Intel  Corporation  iWarp,  the  Paragon/XPS,  the  Cray  T3D,  and  the  IBM 
SP2.  The  experiments  reported  in  this  paper  were  done  on  the  iWarp. 


5  Experimental  results 

In  this  section,  we  present  results  of  our  active  multibaseline  stereo  system.  As  mentioned 
before,  a  pattern  of  sinusoidally  varying  intensity  are  projected  onto  the  scenes  to  facilitate 
image  point  correspondence. 

An  example  of  a  set  of  images  (Scene  1)  and  the  extracted  depth  image  is  shown  in  Fig.  13 
and  Fig.  14  respectively.  The  large  peaks  at  the  borders  of  the  depth  map  are  outliers  due  to 
mismatches  in  the  background  outside  the  depth  range  of  interest. 


Fig.  13  Views  of  the  globe  (Scene!)  from  the  four  cameras  ((a)-(d)) 


Another  example  (Scene  2)  is  shown  in  Fig.  15  with  the  recovered  elevation  map  in  Fig.  16. 
As  can  be  seen  from  the  elevation  map,  except  at  the  edges  of  the  objects  on  the  scene,  the 
data  looks  very  reasonable. 
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Fig.  IS  Views  of  Scene  2 


Fig.  16  Elevation  map  of  Scene  2 


For  Scene  3  (Fig.  17),  subsequent  to  depth  recovery  (Fig.  18),  we  fit  the  known  models  onto 
the  range  data  using  Wheeler  and  Ikeuchi’s  3D  template  matching  algorithm  [18]  to  yield 
results  seen  in  Fig.  19.  Again  the  data  looks  very  reasonable. 


(a)  (b)  (c)  (d) 

Fig.  17  The  four  camera  views  of  Scene  3 


Fig.  18  Extracted  elevation  map  of  Scene  3 


We  have  also  performed  some  error  analysis  on  some  of  the  range  data  that  were  extracted 
from  Scene  2.  Fig.  20  show  the  areas  for  planar  fit;  Table  1  shows  the  numerical  results  of 
the  planar  fit.  As  can  be  seen,  the  average  planar  fit  error  is  smaller  than  1  mm  (the  furthest 
planar  patch  is  about  1.7m  away  from  the  camera  system).  Fig.  21  depicts  the  error  distribu¬ 
tion  of  the  resulting  planar  fit  across  the  image  (only  on  pixels  on  planar  surfaces  in  the 
scene).  The  darker  pixels  are  associated  with  lower  absolute  error  in  planar  fitting. 

We  have  also  obtained  stereo  range  data  of  a  cylinder  of  known  cross-sectional  radius  and 
calculated  the  fit  error.  In  both  scenes  (with  different  camera  settings),  the  cylinder  is  placed 
about  3.3  m  away  from  the  camera  system. 

As  can  be  seen  from  Table  2,  the  mean  absolute  error  of  fit  is  less  than  1  mm. 

6  Observations  on  accuracy 

We  have  exceeded  one  millimeter  accuracy.  Here  we  informally  characterize  the  remaining 
sources  of  error  in  our  system. 
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Fig.  19  Recovered  3D  points  of  Scene3  with  fitted  cylinder  and  box  models  (shown  at  four 

different  viewpoints) 


Fig.  20  Sampled  areas  for  planar  fit. 


There  are  a  number  of  sources  of  error  in  our  system  and  in  stereo  generally: 

1 .  The  use  of  an  active  multibaseline  approach  reduces  the  chance  of  false  matches,  but  they 
can  still  occur. 

2.  The  fundamental  assumptions  of  stereo  are  that  the  texture  being  viewed  is  unique  over 
the  search  window,  and  that  the  surface  is  visible  to  and  lies  at  the  same  angle  to  all  cam¬ 
era  optical  axes.  The  former  assumption  is  addressed  by  the  active  component  of  our  sys- 


Table  1  Results  of  fitting  planes  to  selected  patches  in  Scene2. 


Fig.  21  Plane  fit  error  distribution  for  Scene2  (enhanced,  planar  surfaces  only) 


Fig.  22  Four  camera  views  of  the  first  cylinder  scene 


Table  2  Results  of  fitting  cylinders 


Cylinder  scene  # 

Patch  size 
(pixels) 

Average 
lerrorl  (pm) 

Maximum 
lerrorl  (mm) 

Standard 
deviation  (pm) 

1 

25200 

640 

4.35 

540 

2 

35150 

640 

3.17 

500 
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tern,  but  the  latter  is  not  and  cannot  be,  except  by  placing  the  cameras  as  close  together  as 
possible  (which  reduces'  accuracy).  The  failure  of  this  assumption  is  particularly  evident 
at  the  boundaries  of  objects,  where  it  is  the  cause  of  significant  error. 

3.  Errors  are  possible  during  calibration,  since  the  position  of  our  calibration  plate  is 
adjusted  by  hand  (limiting  its  accuracy  in  positioning  to  about  1  mm),  and  the  dot  pattern 
positions  are  not  always  found  precisely. 

4.  We  use  a  pinhole  camera  model,  which  will  result  in  errors  near  the  edge  of  the  image, 
particularly  with  short  focal  lengths. 

5.  We  make  the  approximation  discussed  in  Section  4.3,  which  will  result  in  errors  when  the 
camera  optical  centers  are  not  colinear. 

Of  these,  only  the  first  seems  to  be  a  cause  of  significant  error  (the  second  also  causes  large 
error,  but  we  deliberately  omit  it  from  our  error  analysis  since  it  is  fundamental  to  stereo). 
All  of  the  large  errors  (more  than  1  mm)  are  observed  to  be  in  regions  where  the  projected 
pattern  does  not  provide  sufficient  texture  for  a  correct  match. 

We  have  attempted  to  reduce  these  errors  by  analysis  and  experimentation.  Analysis  shows 
that  a  frequency-modulated  sine  wave  pattern,  as  used  there,  is  a  good  choice  since  it  does 
not  require  large  dynamic  range  (our  iWarp  video  interface  has  manually  adjustable  gain 
and  offset  controls,  leading  us  to  limit  the  dynamic  range  to  avoid  clipping).  Also,  a  ran¬ 
domly  frequency-modulated  sine  wave  gives  the  best  possible  result,  since  the  same  pattern 
occurs  twice  in  the  search  area  with  vanishingly  small  probability,  theoretically  eliminating 
the  possibility  of  false  matches.  Experiments  with  randomly  modulated  patterns  have  shown 
that 

•  The  lowest  frequency  of  the  sine  wave  (as  seen  in  the  image)  must  be  higher  than  the  width  of  the 
correlation  match  window. 

•  The  highest  frequency  usable  is  constrained  by  the  resolution  of  the  camera  and  the  focus  control 
of  the  projector.  Using  a  higher  frequency  than  the  maximum  results  in  a  gray  blur  and  man>  false 
matches. 

The  trade-off  between  these  two  constraints  involves  optimizing  the  projector  placement 
and  focus,  the  camera  resolution,  the  number  of  cameras,  and  the  camera  dynamic  range. 

In  addition,  many  of  the  problems  of  false  matches  occur  where  the  limited  dynamic  range 
of  our  video  interface  plays  a  role,  particularly  with  dark  surfaces  or  sufaces  which  lie  at  an 
oblique  angle  to  the  projector  (so  that  no  pattern  appears  in  the  image),  or  surfaces  with 
specularities  (so  that  clipping  overwhelms  the  pattern).  In  these  cases,  we  believe  careful 
adjustment  of  the  projector,  including  u.se  of  multiple  projectors  (since  there  is  no  particular 
constraint  between  the  projector  and  camera  in  active  stereo,  this  is  easy  to  do),  can  serve  to 
reduce  the.se  effects.  The  u.se  of  multiple  patterns,  either  time-sequenced  (taking  advantage 
of  our  system’s  ability  to  capture  images  at  high  speed)  or  color-sequenced  (using  color 
cameras)  is  also  promising. 
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7  Summary 

We  have  briefly  described  a  4-camera  system  that  is  capable  of  video  rate  image  acquisition. 
It  uses  a  software  distribution  approach  which  takes  advantage  of  iWarp’s  systolic  design. 
The  four  cameras  are  used  in  a  converging  configuration  for  more  effective  use  of  the  cam¬ 
era  view  spaces.  In  addition,  to  recover  dense  stereo  range  data  from  each  set  of  images,  we 
project  a  sinusoidally  varying  pattern  onto  the  scene  to  enhance  local  intensity  di.scrim- 
inability.  This  results  in  the  notion  of  active  multibaseline  stereo  system. 

We  have  also  described  in  detail  our  implementation  of  the  depth  recovery  algorithm  which 
involves  the  preprocessing  stage  of  image  rectification.  Our  approximate  depth  recovery 
implementation  was  designed  for  reduced  computation. 

The  results  that  we  have  obtained  from  this  system  indicated  that  the  mean  errors  (discount¬ 
ing  object  border  areas)  are  less  than  a  millimeter  at  distances  varying  from  1.5  m  to  3.5  m 
from  the  camera  system.  The  performance  of  the  system  is  thus  comparable  to  a  good  struc¬ 
tured  light  system,  while  allowing  data  to  be  captured  at  full  video  rate. 

Active  multibaseline  stereo  appears  to  be  a  promising  addition  to  structured  light  imaging 
systems.  It  allows  images  to  be  captured  at  high  speed  and  still  have  high  spatial  resolution. 
It  allows  great  freedom  in  the  relationship  between  the  camera,  the  surface,  and  the  light 
source,  making  it  possible  to  manipulate  these  so  as  to  get  high  accuracy  in  a  wide  variety  of 
circumstances. 
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